Systems and methods of epitope binning and antibody profiling

ABSTRACT

Methods for antibody profiling and epitope mapping are provided herein. More particularly, methods for screening and mapping epitopes of candidate antibodies and protein target identification are provided herein.

CROSS-REFERENCE

This application claims the benefit of U.S. Provisional Application No. 62/160,276, filed on May 12, 2015, which is entirely incorporated herein by reference.

BACKGROUND

This invention relates to methods and systems for developing therapeutic affinity reagents. More particularly, the present invention provides methods and systems for epitope mapping and monoclonal antibody profiling.

Antibodies play a central role in the immune system and in modern health care and medical research. They are commonly used as affinity reagents in research and diagnostic applications and have emerged as an important class of therapeutics. In particular, the development of monoclonal antibody (mAb) technology has had a profound impact on medicine. The therapeutic use of first-generation mAb achieved considerable success in the treatment of major diseases, including cancer, inflammation, autoimmune, cardiovascular, and infectious diseases. It is estimated that the majority of newly developed drugs will be biologics which include monoclonal antibodies.

The process of developing a monoclonal therapeutic starts with creating hundreds of hybridomas, each producing a different antibody, and screening pools of several monoclonal antibodies to identify monoclonals that recognize different sites on the target protein (epitopes) and that affect the target protein in the desired fashion. Unfortunately, standard methods developed for mapping antibody epitopes, including peptide tiling and phage, bacteria, and mRNA display, require costly synthesis or enrichment steps. At present, no low-cost universal platform for screening monoclonal antibodies exists. Therefore, there remains a need in the art for cost-effective and efficient methods and systems for monoclonal antibody profiling and epitope mapping.

SUMMARY OF THE INVENTION

Disclosed herein are methods for screening and characterizing antibody binding affinity and specificity to an antigen, including monoclonal antibodies. In general, the method comprises the steps of (a) contacting a sample comprising a monoclonal antibody having unknown specificity for an antigen of interest to a plurality of randomly generated peptides immobilized on a support; (b) selecting for peptides of the plurality that bind to the antibody; (c) screening the selected peptides to identify those that bind most strongly to the antibody; (d) deriving peptide sequences for the identified peptides; and (e) identifying among the derived peptide sequences a conserved motif, where the motif corresponds to an epitope of the antigen to which the monoclonal antibody specifically binds. Identifying among the derived peptide sequences can comprise using a search algorithm to search for a conserved motif. The peptide sequences can be from a database of amino acid sequences. The sample can be a hybridoma culture supernatant. The plurality of random-sequence peptide array can comprise at least 300,000 random-sequence peptides per 0.5 cm². Screening of the selected peptides for those that bind most strongly to the antibody can comprise an immunofluorescence assay. Identifying a consensus sequence can comprise aligning the sequences of said antigens using a search algorithm. The antibody can be a monoclonal antibody.

In some instances, the method further comprises identifying a protein target of the antibody, where identifying comprises searching a protein database for proteins that contain homologous sequences to the consensus sequence motif and retrieving those proteins from the database comparing the one or more consensus motifs to an amino acid sequence database, and verifying that the antibody binds to a protein retrieved from the database search.

Described herein, in some embodiments, are methods of identifying an epitope recognized by an antibody, the method comprising the steps of (a) contacting a sample comprising the antibody to a plurality of peptides immobilized on an array; (b) identifying peptides that bind to the antibody with a K_(d) of less than 10⁻⁷ M; and (c) screening the peptide sequences of the identified peptides for a consensus sequence motif, wherein the motif corresponds to an epitope of the antigen to which the antibody specifically binds. In some embodiments, screening the peptide sequences of the identified peptides for a consensus sequence motif comprises using a search algorithm. In some embodiments, the peptide sequences are from a database of amino acid sequences. In some embodiments, the peptides are randomly generated. In some embodiments, the array comprises at least 10,000 peptide features per 1 cm². In some embodiments, the array comprises at least 300,000 peptide features per 0.5 cm². In some embodiments, the peptides have a length of 1 to 25 amino acids. In some embodiments, identifying peptides that bind to the antibody comprises an immunofluorescence assay. In some embodiments, screening the peptide sequences of the identified peptides for a consensus sequence motif comprises aligning the peptide sequences using a search algorithm. In some embodiments, the number of peptide sequences screened is at least 500. In some embodiments, the antibody is a monoclonal antibody. In some embodiments, the sample is a hybridoma culture supernatant. In some embodiments, the sample is a serum sample. In some embodiments, the serum sample is from a vertebrate. In some embodiments, the serum sample is from a mammal. In some embodiments, the serum sample is from a human. In some embodiments, the serum sample comprises an antibody that recognizes an epitope in an antigen from an infectious organism. In some embodiments, the infectious organism is a pathogen. In some embodiments, the infectious organism is selected from the group consisting of viruses, bacteria, and protists. In some embodiments, the pathogen is Borrelia, Bordetella, hepatitis B virus, Plasmodium, Treponema, or dengue virus.

In some embodiments, the method further comprises identifying a protein target of the antibody, comprising (i) searching a protein sequence database for proteins that contain sequences homologous to the consensus sequence motif; (ii) identifying proteins from step (i); and (iii) verifying that the antibody binds to a protein retrieved from the database search. In some embodiments, homologous sequences show at least 80% identity. In some embodiments, the database comprises proteomes from bacteria, viruses, and eukaryotes. In some embodiments, the eukaryotes are protists. In some embodiments, the bacteria, viruses, and protists are pathogenic. In some embodiments, the identified peptide sequences binding to the antibody are hierarchically clustered and aligned. In some embodiments, the method further comprises examining peptides on the array that are not bound to antibody.

Disclosed herein are methods for characterizing the binding specificity of an antibody, the method comprising the steps of: (a) contacting a sample comprising the antibody to a plurality of peptides immobilized on an array; (b) identifying peptides that bind to the antibody with a K_(d) of less than 10⁻⁷ M; (c) identifying peptides on the array that do not bind to the antibody; and (d) clustering and aligning the identified peptides from (b) and (c) to determine the level of specific binding recognized by the antibody. In some embodiments, the antibody is a monoclonal antibody. In some embodiments, the identified peptides are clustered by similarity of the identified peptides in (b) and (c) to the eliciting peptide used to make the monoclonal antibody. In some embodiments, the identified peptides are hierarchically clustered and aligned. In some embodiments, the level of similarity of identified peptides in steps (b) and (c) to the eliciting peptide is indicative of the degree of promiscuity of antibody binding. In some embodiments, the peptide sequences are from a database of amino acid sequences. In some embodiments, the peptides are randomly generated. In some embodiments, the array comprises at least 10,000 peptide features per 1 cm². In some embodiments, the array comprises at least 300,000 peptide features per 0.5 cm². In some embodiments, the peptides have a length of 1 to 25 amino acids. In some embodiments, identifying peptides that bind to the antibody comprises an immunofluorescence assay.

In some embodiments, screening the peptide sequences of the identified peptides binding to an antibody in step (b) further comprises determining a consensus sequence motif comprises aligning the peptide sequences using a search algorithm. In some embodiments, determining the consensus sequence comprises aligning the identified peptide sequences using a search algorithm. In some embodiments, the number of peptide sequences screened is at least 500. In some embodiments, the sample is a hybridoma culture supernatant. In some embodiments, the sample is a serum sample. In some embodiments, the serum sample is from a vertebrate. In some embodiments, the serum sample is from a mammal. In some embodiments, the serum sample is from a human. In some embodiments, the serum sample comprises an antibody that recognizes an epitope in an antigen from an infectious organism. In some embodiments, the infectious organism is a pathogen. In some embodiments, the infectious organism is selected from the group consisting of viruses, bacteria, and protists. In some embodiments, the pathogen is Borrelia, Bordetella, hepatitis B virus, Plasmodium, Treponema, or dengue virus.

These and other features, aspects, and advantages will become better understood upon consideration of the following detailed description, drawings, and appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.

FIG. 1 is a schematic representation of a method of screening monoclonal antibodies as provided herein. A sample comprising a monoclonal antibody is applied to a random peptide library immobilized on an array, whereby the monoclonal antibody will bind to a subset of the peptides on the array. By aligning the sequence of the antibody-bound peptides, a consensus motif can be determined.

FIG. 2A illustrates top binding subsequences and peptides for the indicated monoclonal antibodies. The upper panel shows the top binding subsequences, and the lower panel shows the subsequences for each monoclonal antibody tested.

FIG. 2B illustrates top binding subsequences and peptides for the indicated monoclonal antibodies. The upper panel shows the top binding subsequences, and the lower panel shows the subsequences for each monoclonal antibody tested.

FIG. 2C illustrates top binding subsequences and peptides for the indicated monoclonal antibodies. The upper panel shows the top binding subsequences, and the lower panel shows the subsequences for each monoclonal antibody tested.

FIG. 2D illustrates top binding subsequences and peptides for the indicated monoclonal antibodies. The upper panel shows the top binding subsequences, and the lower panel shows the subsequences for each monoclonal antibody tested.

FIG. 3A illustrates monoclonal antibody motifs and their corresponding epitopes. Panels show the motifs for the indicated monoclonal antibodies after incubation on peptide microarrays and subsequence analysis.

FIG. 3B illustrates monoclonal antibody motifs and their corresponding epitopes. Panels show histograms of binding profiles for each monoclonal antibody tested.

FIG. 4A illustrates sequence representation and predictive versus non-predictive subsequences. The top 25 sequence motifs found for the monoclonal antibodies HA (left) and p53 (right) are shown.

FIG. 4B illustrates sequence representation and predictive versus non-predictive subsequences. The fraction of all possible k-mers present on the array as a function of k-mer length is shown.

FIG. 5A illustrates the top significant subsequences for disease cohorts. Panels show the top 10 most commonly appearing and significant subsequences in serum samples from the indicated disease cohorts.

FIG. 5B illustrates the top significant subsequences for disease cohorts. The pairwise fractional overlap in significant subsequences is shown. BPE, Bordetella pertussis; HNP, Human Normal Pools, a collection of pools of non-disease individuals.

FIG. 6 illustrates motifs found in single patients. The left panel shows a motif found in a single dengue patient that maps to NS3. The right panel shows a motif present in a single Borrelia patient that maps to the OspF protein. FC, fold change between the individual serum sample and a cohort of normal samples; n, number of peptides associated with that subsequence.

FIG. 7 illustrates finding arbitrary pathogen sequences in a pathogen database. Plots show the distribution of hits to pairs of arbitrary sequences of fixed lengths.

FIG. 8A illustrates using significant subsequences to identify an eliciting pathogen. Sample specific significant subsequences from the malaria cohort are shown.

FIG. 8B illustrates using significant subsequences to identify an eliciting pathogen. Protein matches and organism matches from the malaria cohort are shown.

DETAILED DESCRIPTION OF THE INVENTION

Incorporated herein by reference in its entirety is Richer et al., Molecular & Cellular Proteomics 14.1:136-147, 2015.

Disclosed herein are devices, systems and processes for characterizing antibodies, including monoclonal antibodies, to regionally map the epitope to which the antibody binds. Standard methods for such antibody characterization, also known as epitope binning, typically involve surface plasmon resonance (SPR) technology. Using SPR, monoclonal antibody candidates are screened pairwise for binding to a target protein. Other standard methods involve ELISA-based screens but require synthesis of sets of overlapping peptides corresponding to each protein of interest. The systems and methods provided herein are based at least in part on the inventors' discovery that random peptide arrays can be employed for high-throughput monoclonal antibody profiling. Using random peptide arrays, it is possible to identify peptides that bind to proteins (or other macromolecules) for which peptide affinities were previously unknown. Moreover, the methods and systems provided herein make it possible to simultaneously screen numerous monoclonal antibodies for therapeutic drug development.

Methods

Provided herein are methods, systems and devices useful for screening a plurality of antibodies. By characterizing the binding sites of monoclonal antibodies (mAbs) on protein targets, the screening methods provided herein permit the high-throughput analysis and selection of the most suitable candidates for therapeutic monoclonal antibodies and antibody-based modalities.

In one aspect, the method comprises contacting a sample comprising a monoclonal antibody (mAb) to a plurality of random peptides immobilized on a support; identifying peptides recognized by the mAb; identifying amino acid sequences of mAb-bound peptides; and aligning said peptide sequences to identify one or more consensus motifs, where a consensus motif corresponds to an epitope to which the mAb binds. As used herein, the terms “antibody” and “immunoglobulin” refer to polyclonal and monoclonal antibodies, chimeric, and single chain antibodies, as well as the products of a Fab or other immunoglobulin expression library. The term “antibody” encompasses intact antibody molecules as well as antigen binding fragments thereof (including F(ab′)₂, Fab, Fab′, Fv, Fc, and Fd fragments). In some cases, the antibody is a chimeric antibody produced by recombinant methods to contain both the variable region of the antibody and an invariant or constant region of a human antibody. In other embodiments, the antibody is humanized by recombinant methods to combine the complementarity determining regions (CDRs) of the antibody with both the constant (C) regions and the framework regions from the variable (V) regions of a human antibody. Monoclonal antibodies, all derived from a single B-cell clone, exhibit specificity for a single epitope (also known as an antigenic determinant). As used herein, the term “monoclonal antibody” includes antibodies produced by an antibody-producing B cell that has been isolated and fused to an immortal hybridoma cell line in order to produce large quantities of identical monoclonal antibodies. Alternatively, monoclonal antibodies can be prepared using antibody engineering methods such as phage display. See, for example, U.S. Pat. Nos. 6,300,064 and 5,969,108; and “Antibody Engineering,” McCafferty et al. (Eds.) (IRL Press 1996)).

As used herein, the terms “epitope” and “antigenic determinant” refer to a site on an antigen (e.g., target polypeptide) that is recognized by an immunoglobin or antigen and to which the immunoglobulin or antibody specifically binds. Epitopes can be linear or conformational. Generally, an epitope includes at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 consecutive or non-consecutive amino acids in a unique spatial conformation. See, e.g., Epitope Mapping Protocols in Methods in Molecular Biology, Vol. 66, G. E. Morris, Ed. (1996). Encompassed by the term “epitope” are simple epitopes comprising only a few contiguous amino acid residues as well as complex epitopes that encompass discontinuous amino acids. In some cases, complex epitopes comprise amino acids separated in the primary sequence but in close proximity in the three-dimensional folded structure of an antigen. As used herein, the terms “specific binding,” “selective binding,” “selectively binds,” and “specifically binds” refer to antibody binding to an epitope on a predetermined antigen. As used herein, the term “antibody affinity” refers to the strength of the interaction between an epitope and the antibody's antigen-binding site (also known as a paratrope). Generally, high affinity antibodies bind quickly and more tightly to the antigen and permit greater sensitivity in assays. In some cases, an antibody specifically or selectively binds with an affinity (K_(D)) of approximately less than 10⁻⁷ M, such as approximately less than 10⁻⁸ M, 10⁻⁹ M, or 10⁻¹⁰ M, or lower. The terms “K_(D)” and “K_(d)” are synonymous and refer to the dissociation equilibrium constant of a particular antibody-antigen interaction.

Specific binding can additionally or alternatively be defined as a binding strength (e.g., fluorescence intensity) more than three standard deviations greater than background represented by the mean binding strength of empty control areas in an array (i.e., having no compound, where any binding is nonspecific binding to the support). The range of affinities or avidities of compounds showing specific binding to a monoclonal or other sample can vary by from about 1 to about 4 and often from about 2.5 to about 3.5 orders of magnitude. Avidity is defined as enhanced binding of a component in solution to a surface that includes multiple copies of a compound, such as a peptide, that the solution component has affinity for. In other words, given a compound on a surface that individually has some affinity for a component of a solution, avidity reflects the enhanced apparent affinity that arises when multiple copies of the compound are present on the surface in close proximity. Avidity is distinct from cooperative binding in that the interaction does not involve simultaneous binding of a particular molecule from the solution to multiple molecules of the compound on the surface. Avidity interactions and/or cooperative binding can occur during the association of components of a solution, such as antibodies in blood, with compounds on a surface.

In exemplary embodiments, a sample comprising an antibody is contacted to a plurality of peptides immobilized on a support. The term “peptide” or “oligopeptide” as used herein refers to organic compounds composed of amino acids, which are arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues. The term “peptide” or “oligopeptide” preferably refers to organic compounds composed of less than 70 amino acid residues, more preferably of less than 35 amino acid residues, more preferably of less than 25 amino acid residues.

Preferably, the plurality of peptides comprises a random peptide library, meaning a library of randomly generated peptide sequences. A random peptide library comprises a plurality of randomly generated peptides without a priori assuming a set of eliciting proteins or proteome. In this manner, a random peptide library provides a universal, non-proteome-specific approach. In some cases, an immobilized random peptide library is a microarray. The term “microarray” as used herein refers to a two dimensional arrangement of features on the surface of a solid or semi-solid support. Features as used herein are defined areas on the microarray comprising biomolecules, such as peptides. The features can be designed in any shape, but preferably the features are designed as squares or rectangles. The features can exhibit any density of biomolecules. Preferably, a microarray comprises a high density of peptide features immobilized on a support. As the number of peptide features increases, the amount of information about the binding character of an antibody of interest increases. Arrays typically have at least 100 compounds. Arrays having between 500 and 500,000 compounds provide a compromise between likelihood of obtaining compounds with detectable binding to any target of interest and ease of synthesis and analysis. Arrays having, for example, 100 to 500,000 members or 500-500,000, or 1000-250,000 members can also be used. Arrays having, for example, between 10,000 and 100,000, between 25,000 and 500,000 or between 50,000 and 350,000 are also contemplated within the disclosures herein. In some cases, peptide density on an array is at least 10,000 peptide features per 1 cm². In other cases, peptide density on an array is at least 300,000 peptide features per 0.5 cm². The density of molecules on an array can be controlled by the attachment or in situ synthesis process by which a compound is attached to a support. The length of a coupling cycle and concentration of compound used in coupling can both affect compound density.

The spacing between compounds on an array also can be controlled. The density of different molecules of a compound within an area of an array or on a particle controls the average spacing between molecules of a compound (or compounds in the case of a pooled array), which in turn determines whether a compound is able to form enhanced apparent affinity to a sample (an avidity interaction). If two molecules of a compound or compounds in the case of a pooled array are sufficiently proximate to one another, both molecules can enhance apparent affinity to the same binding partner. In some arrays, at least 10%, 50%, 75%, 90% or 100% of compounds in the array are spaced so as to permit enhanced avidity interactions and/or undergo cooperative binding with a binding partner. However, it is not necessary that all compounds be deposited or synthesized with the same spacing of molecules within an area of the array. For example, in some arrays, some compounds are spaced further apart so as not to permit or permit only reduced avidity interactions or cooperative binding compared with other compounds in an array. Spacing peptides more than 3 nm or 4 nm apart is associated with higher affinity bindings. As spacing decreases (e.g., 1-2 nm) and density increases, lower affinity bindings are more prevalent. In exemplary embodiments, peptide spacing within a feature on the array is less than 3 nm, less than 2 nm or less than 1 nm. For peptides of length 15-25 residues an average (mean) spacing of less than 0.1-6 nm, 1-4 nm, 2-4 nm, e.g., 1, 2 or 3 nm is, for example, suitable to allow different regions of the same compound to undergo binding with enhanced apparent affinity. Average spacings are typically less than 6 nm because spacings of 6 nm or more do not allow avidity to enhance the apparent affinity for the target or cooperative binding to take place. For example, for peptides of lengths 15-25 residues, the two identical binding sites of one antibody could not span more than 6 nm to contact two peptides at once and bind cooperatively. The optimum spacing for enhancing avidity and/or cooperativity interactions may vary depending on the compounds used and the components of the sample being analyzed.

Peptide spacing on an array can be measured experimentally under given conditions of deposition by depositing fluorescently labeled compounds and counting photons emitted from an area of an array. The number of photons can be related to the number of molecules of fluorescein in such an area and in turn the number of molecules of compound bearing the label (see, e.g., U.S. Pat. No. 5,143,854). Alternatively, the spacing can be determined by calculation taking into account the number of molecules deposited within an area of an array, coupling efficiency and maximum density of functional groups, if any, to which compounds are being attached. The spacing can also be determined by electron microscopy of an array or via methods sensitive to the composition of molecules on a surface such as x-ray photoelectron spectroscopy or secondary ion mass spectrometry.

In exemplary embodiments, multiple peptides are immobilized in a pre-selected pattern on a solid support. The term “solid support” as used herein refers to any solid material, having a surface area to which organic molecules (e.g., peptides) can be attached through bond formation or absorbed through electronic or static interactions such as covalent bond or complex formation through a specific functional group. The solid support comprises any appropriate material such as, for example, glass, silicon, silica, polymeric material, poly(tetrafluoroethylene), poly(vinylidene difluoride), polystyrene, polycarbonate, polymethacrylate, ceramic material, and hydrophilic inorganic material. In some cases, the solid support comprises a hydrophilic inorganic material selected from the group consisting of at least one of alumina, zirconia, titania, and nickel oxide. In other cases, the support comprises a combination of materials such as plastic on glass, carbon on glass, and the like.

Preferably, a solution comprising an antibody of interest is contacted to a plurality of peptides immobilized on a support. As depicted in FIG. 1, the solution in some cases is a supernatant collected from a hybridoma culture. In exemplary embodiments, two or more samples comprising hybridoma supernatants generated from immunization with a target protein of interest are sampled individually on peptide arrays. Data collected from analysis of each array can be used to characterize each sample and to develop a molecular recognition or binding for the monoclonal antibody of each hybridoma supernatant.

Binding interactions between components of a sample (e.g., hybridoma supernatant) and an array can be detected in a variety of formats. In some formats, components of the samples are labeled. The label can be a radioisotype or dye among others. The label can be supplied either by administering the label to a patient before obtaining a sample or by linking the label to the sample or selective component(s) thereof.

In exemplary embodiments, detecting binding of an antibody to one or more peptides of the plurality is performed using any appropriate method including, without limitation, detection using a secondary detection reagent. In some cases, the secondary detection reagent is a fluorescently labeled secondary antibody. In other cases, monoclonal antibodies of the sample (e.g., a supernatant collected from a hybridoma culture) are directly labeled. Binding is then detected with a laser scanner. Alternatively, binding to the array can be directly detected using label-free methods such as surface plasmon resonance (SPR) and mass spectrometry. SPR can provide a measure of dissociation constants and dissociation rates.

Optionally, binding interactions between component(s) of a sample and the array can be detected in a competition format. A difference in the binding profile of an array to a sample in the presence versus absence of a competitive inhibitor of binding can be useful in characterizing the sample. The competitive inhibitor can be for example, a known protein associated with a disease condition, such as pathogen or antibody to a pathogen. A reduction in binding of member(s) of the array to a sample in the presence of such a competitor provides an indication that the pathogen is present. Stringency can be adjusted by varying the salts, ionic strength, organic solvent content and temperature at which library members are contacted with the target.

In exemplary embodiments, the amino acid sequence of each peptide of the plurality on the array is known. In such cases, the amino acid sequences of the peptides to which binding is detected are aligned and compared for shared or consensus amino acid motifs. Such sequence alignments can be performed by any appropriate sequence comparison method. For example, the sequences of peptides binding the antibody can be aligned to the protein antigen sequence using a sequence alignment program that performs extensive computations of user data. For example, BLAST software can be used. In such cases, a given sequence entered by a user is aligned and compared against all sequences in a database containing, for example, all proteins, all proteins with structures, all antibody proteins with structures, or all human antibody proteins with structures. In some cases, antigenic peptide sequences are compared against databases including, without limitation, Swiss-PROT, NBRF/PIR, PRF, and GENPEPT using, for example, Fasta or Smith-Waterman algorithms. Examples of existing databases describing the expressed proteins of various organisms include: UniProt (Universal Protein Resource; uniprot.org on the World Wide Web); Ensembl (ensembl.org on the World Wide Web); VEGA (Vertebrate Genome Annotation; vega.sanger.ac.uk/ on the World Wide Web); CCDS (Consensus CDS; ncbi.nlm.nih.gov/projects/CCDS/ on the World Wide Web); UCSC Genome Browser (genome.ucsc.edu on the World Wide Web); Protein database at NCBI (ncbi.nlm.nih.gov/protein on the World Wide Web); and RCSB Protein Data Bank (pdb.org/on the World Wide Web).

Through sequence alignment, it is possible to identify epitope(s) recognized by the antibody of interest. By examining related sequences on the array that are not bound, the essential amino acids for binding can be defined. By examining the number and diversity of non-binding epitope sites, the promiscuity of the antibody can be quantified.

As used herein, the term “motif” refers to a pattern of residues in an amino acid sequence or nucleotide sequence of a defined length that is conserved or shared among two or more sequences. Consensus motifs identified according to a method provided herein are advantageously contiguous motifs of the genetic sequence and represent a linear sequence of the gene. In some cases, however, motifs identified according to a method provided herein is noncontiguous on the linear sequence of the gene. Deterministic motif finding algorithms useful for the methods provided herein include TEIRESIAS (Rigoutsos and Floratos, Bioinformatics 14:55-67 (1998)) and PRATT (Jonassen, Comput. Appl. Biosci. 13:509-522 (1997)).

The methods provided herein permit simultaneous detection of multiple high and low-affinity epitopes of multiple monoclonal antibodies. In exemplary embodiments, the methods also identify monoclonal antibodies reactive to and specific for highly conserved epitopes. The ability to detect low-affinity interactions is particularly advantageous for characterizing candidate therapeutic monoclonal antibodies. While therapeutic antibodies often exhibit high affinity toward their antigen, low affinity off-target binding can affect pharmokinetics of a therapeutic mAb.

Array Construction

Peptide microarrays were manufactured using in situ synthesis of 330,000 random-sequence peptides per each 1-cm² region. Each 75 mm×25 mm slide contained 24 subarrays, each containing the 330,000 peptides. The average length of each peptide was 11.2 amino acids with a standard deviation of ±1.3, normally distributed. The longest peptide was 22 amino acids long, and the shortest was 1 amino acid, with 95% of peptides between 8 amino acids and 14 amino acids. Peptides were synthesized from the C terminus to the N terminus, with the amine group farthest from the array surface.

Prior to assay, arrays were washed in 100% N,N-dimethylformamide for one hour and then introduced to an incubation buffer consisting of 3% BSA in PBS with 0.05% Tween 20 over a period of six hours to allow the solvent phase to completely transition to the aqueous phase. The arrays were then processed via incubation in the presence of antibodies or serum and detected by fluorescent antibody (Legutki J. B. et al., Nat. Commun. 5:4785 (2014)).

Binding Antibodies to the Array

Residual N,N-dimethylformamide was removed by two 5-min washes in distilled water. Arrays were equilibrated in PBS for 30 min and blocked in the incubation buffer. Arrays were washed and briefly spun dry prior to being loaded into the 24-well gasket (Array-It, Santa Clara, Calif.). Incubation buffer was added to each well (100 μl), and 100 μl of 1:2500 diluted sera was added for a final concentration of 1:5000. Arrays were incubated for 1 h at 23° C. with rocking and then washed with incubation buffer plus 1% BSA using a BioTek 405TS plate washer (Biotek, Winooski, Vt.). Anti-human IgG-DyLight 549 (KPL, Gaithersburg, Md.) was added to a final concentration of 5.0 nm to detect the human primary IgG. Unbound secondary antibody was then removed by washing in incubation buffer followed by washing in distilled water (5 min each). The arrays were removed from the gasket while submerged, dunked in isopropanol, and centrifuged dry (800×g, 5 min). Arrays were scanned at 533 nm using an Innoscan 910 array scanner (Innopsys, Carbonne, France). Features were aligned and extracted using GenePix Pro 6.0 (Molecular Devices, Sunnyvale, Calif.).

Monoclonal Antibodies

Eight monoclonal antibodies were used in this study: anti-human HA (Rockland Antibodies, Rockland, Md., [YPYDVPDYA]), DM1A (anti-human tubulin, Invitrogen/Invitrogen, [AALEKDYEEVGV]), Ab1 (anti-human TP53 antibodies, Clontech, Palo Alto, Calif., [TFRHSVVV]), FLAG (Invitrogen, Madison, Wis., [DYKDDDDK]), 4C1 (anti-human TSHR, Santa Cruz Biotechnology, Dallas, Tex., [QAFDSHY]), A10 (Acris Antibodies GmbH, Hiddenhausen, Germany, [EEDFRV]), Ab8 (Anti-human P53, Thermo Fisher Scientific, Waltham, Mass., [TFSDLWKLLPE]), and 2C11 (Acris Antibodies GmbH, [NAHYYVFFEEQE]).

Serum Samples

Sera from seven different disease cohorts and 10 pools of healthy persons (designated as Human Normal Pool) were provided by Seracare Life Sciences (Milford, Mass.). An additional control group of 32 different non-infected volunteers was collected from consenting individuals by the Center for Innovations in Medicine at Arizona State University under IRB #0905004024 (renewed April 2014). The eight cohorts used in this study included 32 healthy (Normals), 9 dengue fever (DEN1 Flaviviridae), 8 Lyme disease (Borrelia burgdorferi), 7 syphilis (Treponema palladium), 13 malaria (Plasmodium falciparum), 12 whooping cough (Bordetella pertussis), 15 hepatitis B virus (Hepadnavirus), and 10 mixed pools of normal subjects (Healthy Normal Pool).

Analytical Methods

Finding Antibody-Specific Peptides

The goal of this study was to find sequence motifs corresponding to an epitope. The first step was to identify peptides that bind specifically to the sample of interest without regard to the peptide sequence. First, arrays were normalized to the median intensity value to account for small differences in serum or dye concentrations. Then, the fold-change was calculated per peptide across the sample of interest (numerator) versus the median of control samples (denominator). The controls for the serum study comprised the 32 healthy volunteers referred to as Normals. The controls for the monoclonal antibody study were a mix of all monoclonal antibodies in this study. For each test, the top 500 peptides were used as seed sequences for epitope discovery.

Maximal Subsequence Algorithm

The algorithm used to find high binding subsequences was designed to find short consensus motifs within a large set of random peptides. It can be divided into two parts: motif identification and significance testing. Seed sequences are computationally divided into all possible subsequences within a certain range of lengths (three to seven amino acids). The sets of these subsequences S_(x) are ranked and evaluated for significance in subsequent steps. The input to the algorithm is a set of sequences S={s₁, s₂, . . . , s_(n)} and associated preprocessed array intensity values Q={q₁, q₂, . . . , q_(n)}. To find a set of significant subsequences, the sequences in S are divided into all possible subsequences containing between three and seven amino acids each. For example, the sequence AVHAD would be divided into the set {AVH, VHA, HAD, AVHA, VHAD, AVHAD}.

All the subsequences in S constitute a new set, S′. Members in S′ have one or more associated values in Q corresponding to the intensities from parent sequences containing that subsequence. The function Q_(sub) as S′→Q^(m) is defined where m is the number of peptides excepting the top 500 seed peptides containing the input subsequence. This gives all intensity values associated with a subsequence.

Sequences s_(i)∈S_(x) are ranked according to their associated values t_(i)=Q_(sub)(s_(i)). A subsequence is considered only if it appears in at least three peptides (t_(i)>3). This value is the support of the subsequence. The ranking function considers the support and the median intensity value median(t_(i)), such that the highest ranked subsequences have at least three appearances on the array and have high median intensities. This criterion is not strictly necessary, but it simplifies significance testing by throwing out non-significant, poorly represented sequences. Once subsequences are filtered and ranked, their significance can be established. This occurs for a given subsequence i using the following nonparametric procedure:

1. Draw t_(i) values from Q at random. Call this vector t′_(i).

2. Compute median(t′_(i)).

3. Repeat steps one and two 10,000 times, resulting in a nonparametric estimate of a t_(i) null distribution. Call this vector D.

4. A p value is computed for subsequence s_(i) according to p_(i)=[Σ_(k∈D)I(median(t′_(i))>k)]/|D|, where I is the indicator function.

5. Correct the p values for multiple hypotheses. The following correction function was used: p′_(i)=p_(i)/[Σ_(si∈Sx)|Q_(sub)(s_(i))|]. For example, if 1000 subsequences are considered, α is 1/1000, resulting in one expected false positive.

Calling Epitope Candidates

Significant subsequences were identified for each individual per disease cohort. In order to determine the most likely epitope candidates, the sequences were ranked in terms of the number of subjects in which they were called significant. The sequences that appeared most often in different individuals within the same group were deemed the most likely epitope candidates (FIGS. 5A-B).

Mapping Epitope Candidates to Pathogen Proteomes

The most common significant subsequences (query sequences) were searched against the pathogen proteome for 100% identity. The probability of a match was assessed by searching randomly drawn array sequences of the same length as the query sequence against the proteome and comparing the expected number of matches to those observed with the query.

Pathogen Identification

The objective was to identify an unknown pathogen based on array sequence information alone. The n significant subsequences from the same cohort were pairwise aligned using the BLOSUM62 substitution matrix, producing an (n×n) matrix of alignment scores. This matrix was hierarchically clustered by single linkage, producing a dendrogram of related subsequences. This analysis revealed peaks of central subsequences that were presumed to be most closely related to the true epitope. These peak sequences were searched against a database of 596 proteomes (hereinafter called the Pathogen Proteome Database) from various strains of pathogenic bacteria, viruses, and protists causing over 100 different diseases. Those proteins and organisms matching all queried sequences with 100% or 80% identity were noted. Probabilities were determined by querying the database with randomly drawn sequences as above.

Minimum Required Sequence Information

In order to find the point at which pathogen proteins could be resolved from a database given fixed epitope information, several sets of random sequences were generated ranging in length from four to seven amino acids. Pairs of sequences with set lengths were drawn from this set and queried against two databases: one containing 596 human pathogens, and another containing over 5000 bacteria, viral, and eukaryotic proteomes. These two databases helped establish the point at which pathogens could be uniquely resolved. For example, any given trimer sequence would be present in many pathogen proteins, but two heptamer sequences are unlikely to appear in a given pathogen protein by chance.

Sequence Logo Generation

Significant subsequences were collected together into a FASTA-formatted list. Multiple alignments were produced with ClustalW2 (Chaddock A. M., et al., EMBO J. 14, 2715 (1995)). A multiple-alignment text file was used as input to WebLogo3 (Bähler M., Rhoads A, FEBS Lett. 513, 107-113 (2002)) using default settings, producing the motif figure.

E-Value Calculations

The reported E-values were calculated by searching random re-orderings (with replacement) of the candidate subsequence against the target proteome, using the mean number of occurrences of 10,000 re-orderings as the E-value.

The following examples are presented by way of illustration and not limitation.

Example 1 Epitope Determination in Monoclonal Antibodies

Experiments were designed to determine whether one could predicatively map epitopes to well-characterized monoclonal antibodies. Eight antibodies with reactivity to a known linear sequence were chosen and analyzed.

Table 1 lists peptides and binding intensities for the eight different monoclonal antibodies. Monoclonal antibodies were used to test the motif search analysis algorithm. The highest rated subsequences were related to the true epitope and to each other to an extent that ensured the emergence of a conserved motif with strong association to the epitope sequence. Ab, antibody; GRAVY, grand average of hydropathicity index (Legutki J. B., Nat. Commun. 5:4785, (2014)).

TABLE 1 Monoclonal antibodies used in this study Mean Mapped Ab signal predic- Epitope name Immunogen Isotype pI GRAVY intensity atively EEDFRV A10 Human Pol II IgG2b 4.1 −1.3 4911 No SDLWKL p53ab8 Human p53 IgG2b, 5.6 −0.3 6243 No IgG2a QAFDSH 4C1 Human IgG2a 5.1 −1.1 971 Yes insulin receptor RHSVV p53ab1 Human p53 IgG1 9.8 0 5074 Yes DYKDDDDK FLAG FLAG IgG1 4 −3.3 1167 Yes peptide AALEKD DM1A Human IgG1κ 4.7 −0.6 5798 Yes tubulin α YPYDVPDYA HA HA peptide IgG1 3.6 −0.9 905 Yes NAHYYVFFE 2C11 Human IgG1 4.5 −1 827 No EQE insulin receptor

The linear epitope for each monoclonal antibody was known and was used as the basis for algorithm development and testing as indicated above. In most cases, simply sorting peptides by intensity per monoclonal antibody was insufficient to reveal epitope motifs among the highest binding peptides. Variation in binding to a specific target comes in part from the amount of non-cognate binding. Highly promiscuous antibodies such as anti-HA bind large numbers of peptides with low similarity to the target, and this created a lack of specificity in the datasets (FIGS. 2A-D, Table 2).

The bar plots in FIGS. 2A-D show the top binding subsequences (top panel) and subsequences (bottom panel) for each of the eight tested monoclonal antibodies. P53Ab1 (RHSVV), HA (DVPD), 4C1 (FDSH), and FLAG (DYDDDK) each had on-target motifs that were identified within each of the top binding peptides, and these were enhanced through subsequence analysis (shown in red). DM1A (ALEKD) had few on-target motifs in its top peptides, but subsequence analysis revealed the true epitope. FLAG cross-reacted most strongly with the epitope from DM1A (ALEKDY), but subsequence analysis successfully removed this effect.

Table 2 shows the number of peptides for each antibody that yielded a signal greater than 5-fold above background (“total binders”) and how many of those had at least 80% sequence identity with the true epitope (“on target”). See Table 1 for a list of true epitopes. A very low percentage (<11%) of the binding peptides had strong sequence similarity with the true epitope, in agreement with previous studies (Halperin R. F. et al., Mol. Cell. Proteomics 10, 10(3):M110.000786 (2011)).

TABLE 2 On-target versus off-target binding Total binders On target Fraction AB1 42,386 466 1.10 × 10⁻² HA 1608 53 3.30 × 10⁻² 4C1 2561 276 1.08 × 10⁻¹ FLAG 7563 0 0 DM1A 44,821 207 4.62 × 10⁻³ A10 44,924 37 8.24 × 10⁻⁴ AB8 46,327 1 2.16 × 10⁻⁵ 2C11 671 0 0

Thus, transforming the data in terms of peptide subsequences revealed highly specific and consistent motifs that corresponded to epitope targets in five of the tested antibodies. Motifs were similar to the exact eliciting peptide sequence. Even when the exact sequence was not present on the array, sequences very similar to the eliciting peptide predominated (FIGS. 2A-D and FIGS. 3A-B).

FIG. 3A shows the five motifs that were revealed after monoclonal antibodies were incubated on the peptide microarrays and subsequence analysis was performed. Sequence logos were created using the top 10 most highly ranked subsequences obtained from the peptide sequences. Weblogos suggested positional dependence with dominating anchor residues and linking or non-anchor regions. “True epitope” is the sequence determined by the manufacturer. “Inter-alignment” is the expected value of pairwise gapless alignment scores (BLOSUM62 matrix) between any two significant subsequences pulled from the arrays. “Fold change” indicates the relative binding strength of the peptides making up the motif versus the median binding intensity for that peptide in the other monoclonal antibodies tested. Antibodies for which consensus motifs could not be found were A10 (EEDFRV), p53Ab8 (SDLWKL), and 2C11 (NAHYYVFFEEQE). Additional information about these antibodies and their immunogens can be found in Table 1.

FIG. 3B shows histograms of each monoclonal antibody tested. The x-axis is the log 10 normalized signal intensity, and the y-axis is the data density. Antibodies demonstrated varied binding profiles, with monoclonals such as HA, 4C1, and FLAG showing a narrow distribution around low intensities, and others such as AB1 and DM1A demonstrating a broader binding profile. See Table 2 for an analysis of on-target versus off-target binding.

As stated above, three of the tested antibodies did not generate a specific response to the expected target sequence. In one of these cases (P53Ab8), the epitope SDLWKL was bound, but because of the high degree of cross-reactivity to non-sequence-similar peptides, one would not expect to map the epitope based on these results alone (FIG. 4A).

FIG. 4A shows the top 25 sequence motifs found for monoclonal antibodies HA (left) and p53 (right). Red outlined regions indicate the closest match to the actual epitope for the given monoclonal antibody. The black number is the average fold change of the peptides containing the indicated motif relative to the same peptides for all other monoclonal antibodies. Although small differences occurred, there is a consensus pattern. In contrast, p53Ab1 (right) demonstrated high overall binding to the true epitope but cross-reacted with many other sequence clusters, preventing good prediction and yielding low fold-change values.

FIG. 4B shows the fraction of all possible k-mers present on the array as a function of k-mer length. The arrays represent 27% of all possible 5-mers redundantly.

Example 2 Groupwise Epitope Determination in Patient Sera

Eight cohorts representing seven different diseases and one group of healthy volunteers were tested using the described methods.

FIG. 5A shows the top 10 most commonly appearing and significant subsequences in serum samples from the indicated disease cohorts. The number of patients within that cohort for which that sequence was called significant is shown in parentheses to the left. The y-axis is categorical and shows each subsequence; the x-axis is the maximum log 10-normalized intensity of the peptide binding on the array for each patient. The total number of samples in each cohort is given as a fraction at the top. Subsequences with exact matches to proteins within the pathogen are indicated with vertical red bars. The top ranked sequences are listed in Table 3 that shows discovered epitope sequences and their proposed antigen mappings as described below (Example 3).

TABLE 3 Proposed epitope mappings for disease cohorts Known In Membrane Putative or Sequence Infection Organism Antigen Antigen IEDB Protein Hypothetical E Value P Value AVHAD Dengue Dengue NS1 Yes Yes N/A No 0.0005 0.0004 virus (1-3) REGEK Dengue Dengue Serine Yes Yes N/A No 0.00083 0.0007 virus 4 protease NS3 DYAFG Syphylis Treponema Lipo- No No Yes Yes 0.27 0.26 pallidum protein EDAK Lyme's Borrelia OspF Yes No Yes No 4.6 0.98 Disease burgdorferi FKEG Pertussis Bordetella Multi-drug No No Yes Yes 3.5 0.96 pertussis Resistance protein SNKQG, Malaria Plasmodium RESA-like Yes No Yes No 0.072 0.067 RLKEP falciparum protein DAFEY Malaria Plasmodium pfEMP1 Yes No Yes No 3.5 0.96 falciparum

Several of the cohorts performed similarly to the monoclonal antibodies in that they identified a relatively small number of peptides with highly homogeneous sequence motifs that were obvious and visible by simple text matching. These cohorts produced a noticeably homogeneous list of peptide sequences that deviated little from a single and readily apparent motif. The multiple alignments of the top 10 sequences for each of these disease cohorts are shown in FIG. 5B. Of the seven disease cohorts tested, five revealed a clear consensus sequence.

FIG. 5B shows the pairwise fractional overlap in significant subsequences. A colored, saturated cell represents a pair of patients in the same cohort that shared at least 50% of their significant subsequences. Grayscale cells represent pairs of patients from different cohorts whose immune systems see similar sequences. Individuals within the same disease cohort showed much more overlap between their significant subsequences than those in different cohorts or the normal cohort, indicating an association between the discovered sequences and the disease state.

Example 3 Identification of Consensus Sequences in Pathogen Proteomes

In order to test whether the groupwise consensus motifs (FIGS. 4A-B) corresponded with true epitopes, the Immune Epitope Database was searched for exact substring matches to sequences from these lists. Despite the small size of this database, the sequence AVHAD from dengue was present in the database and indicated as an epitope from the NS1 protein in two dengue strains (E-value: 5×10⁻⁴).

Further analysis of the other cohorts revealed additional matches to antigenic proteins. The sequence EDAK from Borrelia mapped to known antigen OspF (E-value: 4.6), and DYAFG from syphilis mapped to a lipoprotein in several strains of T. pallidum (E-value: 0.27). Malaria contained sequences SNKQG and RLKEP (FIG. 7), both of which mapped to the ring-infect erythrocyte surface antigen (RESA) protein in P. falciparum 3D7 (E-value: 0.072), and another sequence (DAFEY) mapping to one of the pfEMP1 variants in P. falciparum (E-value: 3.5). The sequence FKEG mapped to an MDR efflux protein in B. pertussis (E-value: 3.5). These results are summarized in Table 3 above.

The two dengue epitopes shown in Table 3 were previously verified using peptide tiling of the NS1 and NS3 proteins against dengue sera. Another two (EDAK, DAFEY) map to known and characterized antigens in Borrelia burgdorferi and Plasmodium falciparum, respectively. The remainder displayed motif conservation consistent with epitopes but mapped to hypothetical proteins. “E-value” refers to the expected number of matches to the presumed epitope sequence(s) within the proteome of interest; “p value” refers to the chance of encountering at least one instance of the sequence within the proteome of interest. Not all proposed epitopes mapped to the proteome with significant p values, but they are reported here as a “best guess” to explain the high response to these sequences on the arrays.

These sequences were short as a result of platform limitations, and the E-values for these matches varied based on the size of the proteome. The dengue sequences are unlikely to arise by chance, at least given the size of the initial peptide library, with E-values <10⁻³. Likewise, the two matches to the RESA protein in P. falciparum together had a low E-value of 0.072 corresponding to a p value of 0.067 (see Table 4).

TABLE 4 Sensitivity and specificity of epitope candidates Sequence Infection Sensitivity Specificity AVHAD Dengue 1 1 REGEK Dengue N/A N/A DYAFG Syphilis 1 1 EDAK Lyme disease 0.125 1 FKEG Pertussis 0.83 1 SNKQG, RLKEP Malaria 0.69 1 DAFEY Malaria 0.46 1

The selection algorithm maximizes sensitivity and might not be a reliable estimate of performance. However, the candidates shown in Table 4 do map to antigenic proteins and are specific to the cohort of interest. Estimates for the REGEK sequence from dengue could not be computed, as this was discovered using a separate set of arrays or too few samples were processed.

Example 4 Individual Epitope Determination in Patient Sera

In order to test the heterogeneity within disease groups, the question was asked whether subsequences were differentially bound between individuals in disease cohorts and normal subjects. It was found that epitope sequences revealed in the groupwise analysis were present in most of the individuals from that group. All nine dengue samples contained AVHAD as a significant subsequence. To visualize the extent of this overlap, the pairwise overlap of significant subsequences was calculated between individuals across disease groups (FIG. 5B).

FIG. 5B shows the pairwise fractional overlap in significant subsequences. A colored, saturated cell represents a pair of patients in the same cohort that shared at least 50% of their significant subsequences. Grayscale cells represent pairs of patients from different cohorts whose immune systems see similar sequences. Individuals within the same disease cohort showed much more overlap between their significant subsequences than those in different cohorts or the normal cohort, indicating an association between the discovered sequences and the disease state.

The feature selection process for the seed peptides requires that antibodies be commonly expressed within a disease cohort. Thus, the antibodies analyzed here displayed highly similar sequences across all individuals within a cohort. These sequences were equally unlikely to appear in other disease groups, also because of the feature selection requirements. However, it should be noted that peptides (features) common within a cohort demonstrated qualitatively greater fold-changes relative to Normals than those with less common sequences within a cohort.

Example 5 Additional Library Complexity Reveals Additional Epitopes

This assay relies on many simultaneous measurements of antibody/peptide interactions. It is useful to know how changes in library content affect results. As only 27% of pentamers were represented on the original arrays, it was hypothesized that a different random library would result in additional targets that were invisible to the original experiments because they were not present. To test this, another array was created with a different set of 330,000 sequences. An attempt was made to find epitopes using a dengue-infected serum sample. This analysis revealed an additional epitope (REGEK, Dengue 4, E-value: 8.3×10⁻⁴) that was previously mapped in the Immune Epitope Database but not present on the original array (FIG. 6).

The motifs shown in FIG. 6 were associated with single patients within a disease cohort. The motif on the left was found in a single dengue patient and maps to NS3 (Garcia G. V. D., Del Angel R. M., Am. J. Trop. Med. Hyg. 56, 466-470 (1997)). It is a mapped epitope and was observable on the random-sequence peptide microarrays. The motif on the right was present in a single Borrelia patient and maps to the OspF protein, known to be associated with an immune response in dogs (Wagner B. et al., Clin. Vaccine Immunol. 19, 527-535 (2012)).

This result suggests that larger arrays should reveal additional antibodies. This experiment did not address specificity, however, and might not be the final argument supporting larger peptide libraries. In order to properly address that question, the second 330,000-peptide library would have to be added to the first and 660,000 peptides would have to be exposed to the sera simultaneously.

Example 6 Mapping Epitope Information to a Database

Having demonstrated that peptide microarrays are capable of resolving epitopes, experiments were designed to determine whether these sequences could predict the eliciting protein from a database of pathogen protein sequences.

Resolving a pathogen in a database given a few short sequences depends on both the size of the database and the length of the consensus motif. It was predicted that when one is using pairs of randomly generated sequences of varying lengths, a pair of pentamers, if known exactly, or a pair of heptamers, if known within 80% identity, is sufficient for resolving a pathogen in the Pathogen Proteome Database (FIG. 7).

As shown in FIG. 7, pairs of k-mers with specified lengths were drawn at random from the distribution associated with array sequences. These were searched against two databases, one containing over 4000 bacteria and viruses (top), and another containing 596 human pathogens (bottom). The plots suggest that when two 7-mer linear epitopes from the same protein antigen are known with at least 80% identity, unique pathogen identification is reliably predicted.

Example 7 Deciphering Eliciting Pathogen Proteins

To improve sensitivity, it was opted for a restrictive search, relying on exact or near-exact (80%) identity and matches in the same protein to multiple pentamer queries. Using significant subsequences from malaria subjects, three epitope candidates (SNKQG, RLKEP, SNKQG) were found. Searching these candidates against the Pathogen Proteome Database (multiple strains of each pathogen) resulted in uniquely identified membrane proteins from P. falciparum matching all three query sequences with 80% identity (FIGS. 8A-B). Two of the query sequences matched with 100% identity to a RESA-like protein, a known antigen in Plasmodium infections. The probability of two randomly drawn pentamers matching to one or more proteins globally in this database of over 1 million sequences is <0.01.

As shown in FIGS. 8A-B, sample specific significant subsequences from the malaria cohort were combined, aligned, and hierarchically clustered by single linkage. This revealed three distinct epitope candidates, indicated by red asterisks. These three sequences were queried against a database of 596 human pathogens for exact and 80% identity. Only one protein from P. falciparum out of all human pathogens contained both RLKEP and SNKQG. The probability of two array 5-mers hitting the same protein by chance is <0.001.

Although the embodiments are described in considerable detail with reference to certain methods and materials, one skilled in the art will appreciate that the disclosure herein can be practiced by other than the described embodiments, which have been presented for purposes of illustration and not of limitation. Therefore, the scope of the appended claims should not be limited to the description of the embodiments contained herein. 

We claim:
 1. A method of identifying an epitope recognized by an antibody, the method comprising the steps of: (a) contacting a sample comprising the antibody to a plurality of peptides immobilized on an array; (b) identifying peptides that bind to the antibody with a K_(d) of less than 10⁻⁷ M; and (c) screening the peptide sequences of the identified peptides for a consensus sequence motif, wherein the motif corresponds to an epitope of the antigen to which the antibody specifically binds.
 2. The method of claim 1, wherein screening the peptide sequences of the identified peptides for a consensus sequence motif comprises using a search algorithm.
 3. The method of claim 1, wherein the peptide sequences are from a database of amino acid sequences.
 4. The method of claim 3, wherein the peptides are randomly generated.
 5. The method of claim 1, wherein the array comprises at least 10,000 peptide features per 1 cm².
 6. The method of claim 1, wherein the array comprises at least 300,000 peptide features per 0.5 cm².
 7. The method of claim 1, wherein the peptides have a length of 1 to 25 amino acids.
 8. The method of claim 1, wherein identifying peptides that bind to the antibody comprises an immunofluorescence assay.
 9. The method of claim 1, wherein screening the peptide sequences of the identified peptides for a consensus sequence motif comprises aligning the peptide sequences using a search algorithm.
 10. The method of claim 1, wherein the number of peptide sequences screened is at least
 500. 11. The method of claim 1, wherein the antibody is a monoclonal antibody.
 12. The method of claim 1, wherein the sample is a hybridoma culture supernatant.
 13. The method of claim 1, wherein the sample is a serum sample.
 14. The method of claim 13, wherein the serum sample is from a vertebrate.
 15. The method of claim 13, wherein the serum sample is from a mammal.
 16. The method of claim 13, wherein the serum sample is from a human.
 17. The method of claim 13, wherein the serum sample comprises an antibody that recognizes an epitope in an antigen from an infectious organism.
 18. The method of claim 17, wherein the infectious organism is a pathogen.
 19. The method of claim 17, wherein the infectious organism is selected from the group consisting of viruses, bacteria, and protists.
 20. The method of claim 18, wherein the pathogen is Borrelia, Bordetella, hepatitis B virus, Plasmodium, Treponema, or dengue virus.
 21. The method of claim 1, further comprising identifying a protein target of the antibody, comprising (i) searching a protein sequence database for proteins that contain sequences homologous to the consensus sequence motif; (ii) identifying proteins from step (i); and (iii) verifying that the antibody binds to a protein retrieved from the database search.
 22. The method of claim 21, wherein homologous sequences show at least 80% identity.
 23. The method of claim 21, wherein the database comprises proteomes from bacteria, viruses, and eukaryotes.
 24. The method of claim 23, wherein the eukaryotes are protists.
 25. The method of claim 23, wherein the bacteria, viruses, and eukaryotes are pathogenic.
 26. The method of claim 1, wherein the identified peptide sequences binding to the antibody are hierarchically clustered and aligned.
 27. The method of claim 1, further comprising examining peptides on the array that are not bound to antibody.
 28. A method of characterizing the binding specificity of an antibody, the method comprising the steps of: (a) contacting a sample comprising the antibody to a plurality of peptides immobilized on an array; (b) identifying peptides that bind to the antibody with a K_(d) of less than 10⁻⁷ M; (c) identifying peptides on the array that do not bind to the antibody; and (d) clustering and aligning the identified peptides from (b) and (c) to determine the level of specific binding recognized by the antibody.
 29. The method of claim 28, wherein the antibody is a monoclonal antibody.
 30. The method of claim 29, wherein the identified peptides are clustered by similarity of the identified peptides in (b) and (c) to the eliciting peptide used to make the monoclonal antibody.
 31. The method of claim 30, wherein the identified peptides are hierarchically clustered and aligned.
 32. The method claim 30, wherein the level of similarity of identified peptides in steps (b) and (c) to the eliciting peptide is indicative of the degree of promiscuity of antibody binding.
 33. The method of claim 28, wherein the peptide sequences are from a database of amino acid sequences.
 34. The method of claim 33, wherein the peptides are randomly generated.
 35. The method of claim 28, wherein the array comprises at least 10,000 peptide features per 1 cm².
 36. The method of claim 28, wherein the array comprises at least 300,000 peptide features per 0.5 cm².
 37. The method of claim 28, wherein the peptides have a length of 1 to 25 amino acids.
 38. The method of claim 28, wherein identifying peptides that bind to the antibody comprises an immunofluorescence assay.
 39. The method of claim 28, wherein screening the peptide sequences of the identified peptides binding to an antibody in step (b) further comprises determining a consensus sequence motif comprises aligning the peptide sequences using a search algorithm.
 40. The method of claim 39, wherein determining the consensus sequence comprises aligning the identified peptide sequences using a search algorithm.
 41. The method of claim 28, wherein the number of peptide sequences screened is at least
 500. 42. The method of claim 28, wherein the sample is a hybridoma culture supernatant.
 43. The method of claim 28, wherein the sample is a serum sample.
 44. The method of claim 43, wherein the serum sample is from a vertebrate.
 45. The method of claim 43, wherein the serum sample is from a mammal.
 46. The method of claim 43, wherein the serum sample is from a human.
 47. The method of claim 43, wherein the serum sample comprises an antibody that recognizes an epitope in an antigen from an infectious organism.
 48. The method of claim 47, wherein the infectious organism is a pathogen.
 49. The method of claim 47, wherein the infectious organism is selected from the group consisting of viruses, bacteria, and protists.
 50. The method of claim 48, wherein the pathogen is Borrelia, Bordetella, hepatitis B virus, Plasmodium, Treponema, or dengue virus. 